Add Deepseek 3.1 & Deepseek 3.1 base#2607
Add Deepseek 3.1 & Deepseek 3.1 base#2607cheachu wants to merge 10 commits intokeras-team:masterfrom
Conversation
Summary of ChangesHello @cheachu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands Keras Hub's capabilities by integrating the advanced DeepSeek-V3.1 large language model. The implementation focuses on cutting-edge architectural components like Multi-head Latent Attention for efficient memory usage during inference, a highly optimized Mixture-of-Experts system for scalable computation, and advanced positional encoding techniques for handling longer contexts. This addition provides Keras users with a powerful, state-of-the-art language model that is fully compatible with Keras 3's multi-backend design. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request introduces the DeepSeek-V3.1 model architecture to Keras Hub, including Multi-head Latent Attention (MLA), DeepSeekMoE, YaRN RoPE Scaling, RMSNorm, SwiGLU FFN, Tokenizer, and CausalLM. A critical security vulnerability was identified in the DeepSeekV31Tokenizer class where the proto argument, accepting a file path to a SentencePiece model, is loaded without checking Keras's safe_mode. This bypasses security protections against arbitrary file reads during model deserialization. It is recommended to include the necessary safe_mode check. Additionally, there are minor issues related to naming conventions in the presets file and a potential for improved clarity in the _yarn_inv_freq method's constant.
| if proto is not None: | ||
| try: | ||
| import sentencepiece as spm | ||
|
|
||
| sp = spm.SentencePieceProcessor() | ||
| sp.Load(proto) |
There was a problem hiding this comment.
The proto argument allows loading a SentencePiece model from an arbitrary file path. This bypasses Keras's safe_mode protection, which is intended to prevent arbitrary file reads during deserialization of untrusted models. The parent class BytePairTokenizer implements a check for safe_mode when loading from a path, and DeepSeekV31Tokenizer should do the same for the proto argument to maintain a consistent security posture.
| if proto is not None: | |
| try: | |
| import sentencepiece as spm | |
| sp = spm.SentencePieceProcessor() | |
| sp.Load(proto) | |
| if proto is not None: | |
| from keras.src.saving import serialization_lib | |
| if isinstance(proto, str) and serialization_lib.in_safe_mode(): | |
| raise ValueError( | |
| "Requested the loading of a SentencePiece proto file outside of the " | |
| "model archive. This carries a potential risk of loading " | |
| "arbitrary and sensitive files and thus it is disallowed " | |
| "by default. If you trust the source of the artifact, you " | |
| "can override this error by passing `safe_mode=False` to " | |
| "the loading function, or calling " | |
| "`keras.config.enable_unsafe_deserialization()`." | |
| ) | |
| try: | |
| import sentencepiece as spm | |
| sp = spm.SentencePieceProcessor() | |
| sp.Load(proto) |
|
|
||
| # Wavelength = 2π / freq. High-freq → small wavelength, low-freq → | ||
| # large wavelength. YaRN applies more scaling to low-freq dimensions. | ||
| wavelengths = 2.0 * 3.14159265358979 / freqs |
There was a problem hiding this comment.
|
|
||
| # Metadata for loading pretrained model weights and configurations. | ||
| backbone_presets = { | ||
| "deepseek_v3_base": { |
There was a problem hiding this comment.
The preset name deepseek_v3_base does not align with the model name DeepSeekV31 used throughout the codebase. The repository's naming conventions (Rule 55) state that preset names should be in snake_case and follow the pattern <model_name>_<component_type>.py. To maintain consistency, this should be deepseek_v31_base.
backbone_presets = {
"deepseek_v31_base": {| "37B activated parameters." | ||
| ), | ||
| "params": 671000000000, | ||
| "path": "deepseek_v3", |
| "params": 671000000000, | ||
| "path": "deepseek_v3", | ||
| "model_type": "MoE", | ||
| "tokenizer": "DeepSeekV3Tokenizer", |
There was a problem hiding this comment.
| preprocessor_presets = { | ||
| "deepseek_v3_base": { | ||
| "metadata": { | ||
| "description": "DeepSeek V3 preprocessor.", |
| "metadata": { | ||
| "description": "DeepSeek V3 preprocessor.", | ||
| "path": "deepseek_v3", | ||
| }, |
There was a problem hiding this comment.
| "path": "deepseek_v3", | ||
| }, | ||
| "kaggle_handle": "kaggle://deepseek-ai/deepseek-v3/preprocessor/1", | ||
| }, |
There was a problem hiding this comment.
The preset name deepseek_v3 does not align with the model name DeepSeekV31 used throughout the codebase. The repository's naming conventions (Rule 55) state that preset names should be in snake_case and follow the pattern <model_name>_<component_type>.py. To maintain consistency, this should be deepseek_v31.
"deepseek_v31": {| }, | ||
| "deepseek_v3": { | ||
| "metadata": { | ||
| "description": "DeepSeek V3 preprocessor.", |
| "metadata": { | ||
| "description": "DeepSeek V3 preprocessor.", | ||
| "path": "deepseek_v3", | ||
| }, |
There was a problem hiding this comment.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This is an impressive and comprehensive contribution, adding the DeepSeek-V3.1 model to Keras Hub. The implementation is well-structured, follows the repository's modular design, and makes good use of Keras 3 features for backend-agnostic code. The vectorized MoE layer and MLA implementation are particularly well done. I have a few suggestions to improve code clarity, consistency, and testing, primarily related to adhering to the style guide and fixing an empty test case. Overall, great work on this complex model implementation.
| def test_backbone_basics(self): | ||
| original_assert_dtype_equal = self.assertDTypeEqual | ||
|
|
||
| def assert_dtype_flexible(tensor, expected_dtype, msg=None): | ||
| actual_dtype = str(tensor.dtype) | ||
| allowed_dtypes = ["float16", "bfloat16"] | ||
| if actual_dtype not in allowed_dtypes: | ||
| self.fail( | ||
| msg | ||
| or f"Tensor dtype {actual_dtype} not in allowed {allowed_dtypes}" # noqa: E501 | ||
| ) |
There was a problem hiding this comment.
This test case test_backbone_basics is currently empty and does not perform any checks. It should be implemented to validate the backbone's basic functionality using self.run_backbone_test, as shown in the repository's testing guidelines (lines 462-468).
def test_backbone_basics(self):
self.run_backbone_test(
cls=DeepSeekV31Backbone,
init_kwargs=self.init_kwargs,
input_data=self.input_data,
expected_output_shape=(2, 5, 64),
)References
- The style guide requires using helper methods like
self.run_backbone_test()to verify basic usage and shape inference for backbone models. (link)
|
|
||
| # Wavelength = 2π / freq. High-freq → small wavelength, low-freq → | ||
| # large wavelength. YaRN applies more scaling to low-freq dimensions. | ||
| PI = 3.14159265358979 |
| def __init__(self, *args, **kwargs): | ||
| super().__init__(*args, **kwargs) |
There was a problem hiding this comment.
The __init__ method should have explicit arguments instead of *args, **kwargs to improve clarity and align with the repository's style guide (see lines 233-240 of CONTRIBUTING_MODELS.md).
| def __init__(self, *args, **kwargs): | |
| super().__init__(*args, **kwargs) | |
| def __init__( | |
| self, | |
| tokenizer, | |
| sequence_length=1024, | |
| add_start_token=True, | |
| add_end_token=True, | |
| **kwargs, | |
| ): | |
| super().__init__( | |
| tokenizer=tokenizer, | |
| sequence_length=sequence_length, | |
| add_start_token=add_start_token, | |
| add_end_token=add_end_token, | |
| **kwargs, | |
| ) |
References
- The style guide provides an example for Preprocessor
__init__methods with explicit arguments for tokenizer and sequence_length. (link)
|
|
||
| # Normalize only the selected K scores (eq. 13). | ||
| top_k_weights = top_k_scores / ( | ||
| ops.sum(top_k_scores, axis=-1, keepdims=True) + 1e-9 |
There was a problem hiding this comment.
The epsilon value 1e-9 is hardcoded. For consistency with other layers like DeepSeekV31RMSNorm and for better maintainability, consider making this a configurable parameter in the __init__ method with a default value. This would involve adding an epsilon argument to __init__ and updating get_config.
… format checks + removed the hardcoded pi and epsilon values
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive and high-quality implementation of the DeepSeek-V3.1 model, including its novel Multi-head Latent Attention (MLA) and Mixture-of-Experts (MoE) components. The code is well-structured, follows the repository's style guide closely, and includes thorough testing. My feedback is minor and focuses on a couple of style improvements to enhance code consistency and readability.
|
|
||
| # Wavelength = 2π / freq. High-freq → small wavelength, low-freq → | ||
| # large wavelength. YaRN applies more scaling to low-freq dimensions. | ||
| import math |
There was a problem hiding this comment.
The math module is imported locally within the _yarn_inv_freq method. According to Python style guidelines (PEP 8), imports should be placed at the top of the file. Please move import math to the top of the file to make dependencies clear and avoid re-importing.
References
- The repository style guide requires using
rufffor code formatting (line 736).ruffwould flag this as an out-of-place import (E402). Moving imports to the top of the file is a standard Python convention (PEP 8) that improves readability by making dependencies clear at a glance. (link)
|
Hi @sachinprasadhs 👋 |
Description of the change
This PR introduces the DeepSeek-V3.1 model architecture to Keras Hub, fully compliant with the Keras 3 backend-agnostic design (TensorFlow, JAX, PyTorch).
Key architectural components implemented in this PR:
ops.einsumand batched kernel tensors, ensuring it is 100% XLA-compatible (jit_compile=True) and avoids the severe graph bloat associated with iterating over 256 experts.CausalLMandCausalLMPreprocessorwrappers for text generation.Reference
Checklist